43 research outputs found

    Statistical structures for internet-scale data management

    Get PDF
    Efficient query processing in traditional database management systems relies on statistics on base data. For centralized systems, there is a rich body of research results on such statistics, from simple aggregates to more elaborate synopses such as sketches and histograms. For Internet-scale distributed systems, on the other hand, statistics management still poses major challenges. With the work in this paper we aim to endow peer-to-peer data management over structured overlays with the power associated with such statistical information, with emphasis on meeting the scalability challenge. To this end, we first contribute efficient, accurate, and decentralized algorithms that can compute key aggregates such as Count, CountDistinct, Sum, and Average. We show how to construct several types of histograms, such as simple Equi-Width, Average-Shifted Equi-Width, and Equi-Depth histograms. We present a full-fledged open-source implementation of these tools for distributed statistical synopses, and report on a comprehensive experimental performance evaluation, evaluating our contributions in terms of efficiency, accuracy, and scalability

    Hybrid algorithms for subgraph pattern queries in graph databases

    Get PDF
    Numerous methods have been proposed over the years for subgraph query processing, as it is central to graph analytics. Existing work is fragmented into two major categories. Methods in the filter-then-verify (FTV) category first construct an index of the DB graphs. Given a query, the index is used to filter out graphs that cannot contain the query. On the remaining graphs, a subgraph isomorphism algorithm is applied to verify whether each graph indeed contains the query. A second category of algorithms is mainly concerned with optimizing the Subgraph Isomorphism (SI) testing process (an NP-Complete problem) in order to find all occurrences of the query within each DB graph, also known as the matching problem. The current research trend is to totally dismiss FTV methods, because SI methods have been shown to enjoy much shorter query execution times and because of the alleged high costs of managing the DB graph index in FTV methods. Thus, a number of new SI methods are being proposed annually. In the current work, we initially study the performance of the latest SI algorithms over datasets consisting of a large number of graphs. With our study, we evaluate the algorithms’ performance and we provide comparison details with former studies. As a second step, we combine the powerful filtering of a top-performing FTV method, with the various SI methods, which leads to the best practice conclusion that SI and FTV shouldn’t be thought of as disjoint types of solutions, as their union achieves better results than any one of them individually. Specifically, we experimentally analyze and quantify the (positive) impact of including the essence of indexed FTV methods within SI methods, showing that query processing times can be significantly improved at modest additional memory costs. We show that these results hold over a variety of well-known SI methods and across several real and synthetic datasets. As such, hybrids of the type reveal a missing opportunity and a blind spot in related literature and trends

    Rank join queries in NoSQL databases

    Get PDF
    Rank (i.e., top-k) join queries play a key role in modern analytics tasks. However, despite their importance and unlike centralized settings, they have been completely overlooked in cloud NoSQL settings. We attempt to fill this gap: We contribute a suite of solutions and study their performance comprehensively. Baseline solutions are offered using SQLlike languages (like Hive and Pig), based on MapReduce jobs. We first provide solutions that are based on specialized indices, which may themselves be accessed using either MapReduce or coordinator-based strategies. The first index-based solution is based on inverted indices, which are accessed with MapReduce jobs. The second index-based solution adapts a popular centralized rank-join algorithm. We further contribute a novel statistical structure comprising histograms and Bloomlters, which forms the basis for the third index-based solution. We provide (i) MapReduce algorithms showing how to build these indices and statistical structures, (ii) algorithms to allow for online updates to these indices, and (iii) query processing algorithms utilizing them. We implemented all algorithms in Hadoop (HDFS) and HBase and tested them on TPC-H datasets of various scales, utilizing dierent queries on tables of various sizes and different score-attribute distributions. We ported our implementations to Amazon EC2 and in-house lab clusters of various scales. We provide performance results for three metrics: query execution time, network bandwidth consumption, and dollar-cost for query execution. © 2014 VLDB Endowment

    AESOP: Altruism-endowed self organizing peers

    No full text

    Statistical Structures for Internet-Scale Query Optimization

    No full text

    Sizeof() in P2P Networks: Hash Sketches over DHTs

    No full text

    SeAl: managing accesses and data in peer-to-peer data sharing networks

    No full text
    We present SeAl, a novel data/resource and data-access management infrastructure designed for the purpose of addressing a key problem in P2P data sharing networks, namely the problem of wide-scale selfish peer behavior. Selfish behavior has been manifested and well documented and it is widely accepted that unless this is dealt with, the scalability, efficiency, and the usefulness of P2P sharing networks will be diminished. SeAl essentially consists of a monitoring/accounting subsystem, an auditing/verification subsystem, and incentive mechanisms. The monitoring subsystem facilitates the classification of peers into selfish/altruistic. The auditing/verification layer provides a shield against perjurer/slandering and colluding peers that may try to cheat the monitoring subsystem. The incentives mechanisms effectively utilize these layers so to increase the computational/networking and data resources that are available to the community. Our extensive performance results show that SeAl performs its tasks swiftly, while the overhead introduced by our accounting and auditing mechanisms in terms of response time, network, and storage overheads are very small

    The RangeGuard: Range Query Optimization in Peer-to-Peer Data Networks

    No full text

    Towards hybrid methods for graph pattern queries

    No full text
    In the subgraph querying problem, given a query graph and a dataset of graphs, we want to locate which graphs in the dataset contain the query and/or find all its occurrences. Over the years, numerous methods, fragmented in 2 categories, were proposed to tackle the problem. In the first category, methods follow the filter-then-verify (FTV) paradigm where an index is used to filter out graphs that definitely do not contain the query as an answer. On the remaining set of graphs, a subgraph isomorphism algorithm is applied to verify the graphs that contain the query graph. A second category, so called no-filter, verify (NFV), invested in optimizing the subgraph isomorphism process, by employing a lightweight index primarily to locate candidate vertices on the graphs in the dataset. The current trend is to totally dismiss the FTV methods and employ NFV methods instead. With this work we wish to point out that a hybrid solution that is, a combination of the filtering of some FTV method with the subgraph isomorphism test of top-performing NFV methods, can be highly competitive in terms of performance vis-a-vis methods from either the FTV or the NFV category. We will present some preliminary results, based on experiments, where such a combination proves to be beneficial by primarily avoiding the initiation of a large number of redundant subgraph isomorphism tests. © 2017, Copyright is with the authors
    corecore